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ARTICLE 



Kernel canonical correlation analysis for assessing 
gene-gene interactions and application to ovarian 
cancer 

Nicholas B Larson\ Gregory D Jenkins\ Melissa C Larson\ Robert A Vierkant\ Thomas A Sellers^, 
Catherine M Phelan^, Joellen M Schildkraut^, Rebecca Sutphen"*, Paul PD Pharoah^, Simon A Gayther^, 
Nicolas Wentzensen'', Ovarian Cancer Association Consortium, Ellen L Goode^ and Brooke L Fridley"^'^'^ 

Although single-locus approaches have been widely applied to identify disease-associated single-nucleotide polymorphisms 
(SNPs), complex diseases are thought to be the product of multiple interactions between loci. This has led to the recent 
development of statistical methods for detecting statistical interactions between two loci. Canonical correlation analysis (CCA) 
has previously been proposed to detect gene-gene coassociation. However, this approach is limited to detecting linear relations 
and can only be applied when the number of observations exceeds the number of SNPs in a gene. This limitation is particularly 
important for next-generation sequencing, which could yield a large number of novel variants on a limited number of subjects. 
To overcome these limitations, we propose an approach to detect gene-gene interactions on the basis of a kernel ized version of 
CCA (KCCA). Our simulation studies showed that KCCA controls the Type-I error, and is more powerful than leading gene-based 
approaches under a disease model with negligible marginal effects. To demonstrate the utility of our approach, we also applied 
KCCA to assess interactions between 200 genes in the NF-kB pathway in relation to ovarian cancer risk in 3869 cases and 
3276 controls. We identified 13 significant gene pairs relevant to ovarian cancer risk (local false discovery rate <0.05). Finally, 
we discuss the advantages of KCCA in gene-gene interaction analysis and its future role in genetic association studies. 
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INTRODUCTION 

Genome- wide association studies (GWAS) have identified hundreds 
of loci that harbor genetic variants that influence predisposition to a 
particular phenotype. Such studies involve the characterization of a 
large number of single-nucleotide polymorphisms (SNPs) across the 
genome and comparison of the frequency of variants across disease 
states. Initial GWAS analysis strategies involved single locus models, 
whereby individual markers were tested independently for association 
with a given phenotype. Although this approach has successfully 
identified regions of disease susceptibility, some contend it has failed 
to fuUy explain the heritability of complex phenotypes.^'^ As common 
complex diseases and traits are thought to be a result of complex 
interactions and multiple low-penetrance variants,^'^ multi-locus SNP 
models, as opposed to single SNP models, may better capture the true 
underlying genotypic-phenotypic relationship. 

One strategy for multi-locus modeling is to jointly model the 
effects all SNPs within a given gene (eg, multivariable logistic 
regression models). However, this approach may lack power as the 
degrees of freedom of the model could be large and may require 
filtering or shrinkage approaches. Another drawback to the joint 
modeling of multiple SNPs within a gene is possible model fitting 
issues due to multicoUinearity between SNPs (ie, linkage 



disequilibrium (LD)), as well as the lack of inclusion of LD 
information in the analysis. Recently, this idea of gene-level analysis 
has led to the concept of gene-gene interaction analysis, as opposed to 
SNP-SNP interaction approaches. Gene-gene interactions are not 
only hypothesized to have a large role in explaining missing 
heritability,^ they can also serve to provide biological information 
through construction of novel gene pathway topologies. Although 
classically gene-gene interactions have been defined statistically as 
deviance from additive marginal effects, such as in the case of logistic 
regression model, this type of model is limiting with respect to 
statistical power. Moreover, the results of such SNP-SNP interaction 
analyses lack clear biological interpretability. 

Zhao et af proposed testing for interactions between two unlinked 
loci using measures of LD, which can be extended to case- control 
design by comparing such measures across case or disease status. This 
concept was adapted to gene-level analysis by Peng et alj which used 
a Wald-type (7-statistic based on canonical correlation analysis^ 
(CCA) to detect gene-gene coassociation in case-control studies. As 
an LD-based procedure, the CCA approach obtains the maximal 
correlation of the linear combinations of the SNPs, coded as 0, 1 or 2 
in terms of the minor allele, between two genes across case-control 
status, and tests whether the difference in the first canonical 
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correlations is statistically significant. Although there are many 
benefits to this approach, it is limited to analyses where the number 
of observations exceeds the number of markers. Moreover, the use of 
CCA can only be used to detect linear relationships, which may limit 
power in the presence of nonlinear correlations between genes. 
Finally, CCA generally requires a large sample-to-feature ratio to 
avoid issues with overfitting the data, which raises questions of model 
regularization. 

One solution to the above limitations involving the use of CCA for 
assessing gene-gene interactions is the use of kernels. Kernel methods 
are generally defined as algorithms that analyze data represented by 
similarity matrices, which are derived through the use of positive 
definite kernel functions.^ By mapping the original data to a 
nonlinear feature space, traditional linear methods involving dot 
products have been extended to nonlinear applications through the 
use of the 'kernel trick'. The application of kernel machines is quite 
popular as a method to derive metrics of genomic similarity,^ ^ and 
their use has been successful in the area of gene-level association 
analyses such as SKAT^^'^^ and SPA-SG.^"^ Kernelized version of 
CCA (KCCA) provides a straightforward generalization of CCA to 
nonlinear correlations by applying CCA to kernel-generated feature 
spaces. 

In this article, we develop a KCCA procedure for identifying 
coassociation between genes using genome-wide SNP data from a 
case-control study of complex phenotypes. We briefly discuss sample 
CCA and its kernelized version. We further outline and address the 
statistical and computational issues that accompany this approach, 
including concerns of regularization. To evaluate the properties of this 
method, we examine control of Type-I error rate and the power of 
our KCCA method compared with other existing methods for gene- 
gene interaction detection by a simulation study. Finally, we apply our 
KCCA approach to a study case- control study of invasive epitheHal 
ovarian cancer to determine gene-gene interactions between genes 
within the NF-kB gene pathway. 

MATERIALS AND METHODS 

Data definition 

Let Sg be the number of SNPs corresponding to the gth gene in a given gene 
list of size G. Define to be the genotype value for the jth SNP in gene g in 
the kth case subject, for g = 1 , ... , G, 7 = 1 , ... , Sg, and /c = 1 , ... , where 
x^j^ G {0, 1, 2} is the number of copies of the minor allele for SNP j in gene g 
under an assumed joint additive-additive genetic model. Similarly, we define 
x^j^ for the control subjects, where k=\, . . . Nq. This genotypic information 
can in turn be represented by the respective N^xsg and NcXSg matrices 
and Xg . These matrices may also contain adjusted genotype values that have 
been corrected for various covariates and population stratification. 

Hypothesis testing 

To test whether there is a statistical interaction between two genes across case- 
control status, we use KCCA to generate measures of genetic coassociation for 
both case and control status. For given genes l,me{\,...,G}, such that l^m, 
consider the genotype matrices X{^, X^, Xf, and X^, with corresponding 
reduced kernel representations K^*, K^*, K^^* , and K^*. Define 
r^^ = CCA(Kf*, K^*) and rf^ = CCA(Kf , K^*) to be the respective maximal 
kernel canonical correlations for cases and controls between genes / and m 
(see Appendices I-III in Supplementary Material for details). We then define a 
statistic based upon an analog of the Fisher variance stabilizing transformation 
of the Pearson's correlation coefficient/^ given as 

z(r)=^(log(l + r)-log(l-r)). 

The transformation of the canonical correlation rf^ is written as z[r^^) =^/m' 
which is approximately distributed as standard normal. A Wald-type statistic 



for assessing the statistical significance of the difference in gene-gene 
coassociation between cases and controls for genes / and m is defined by 
Peng et af as 

nn Im Im 

which is asymptotically distributed as N{0, 1) under the null hypothesis that 
^/m^^/m' ^^^^^ controls are independent. For estimating the 

transform variances var(zj;^) and var(zjjjj), we apply a robust resampling 
procedure, the trimmed jackknife^^ (Appendix W in Supplementary Materials). 

Multiple testing 

For applications involving exhaustive hypothesis testing of all pairwise gene 
comparisons in a given gene list, multiple testing and test statistic correlation 
become problematic issues. To address both of these directly, we apply Efron's 
empirical nuU method^^ for estimating the local false discovery rate (IFDR) for 
each hypothesis test conducted. 

Simulation study 

To assess the properties of our KCCA procedure for gene-gene interaction 
testing, we consider simulation studies that evaluate type- 1 error control and 
power. We generated a population of haplotypes for two genes using real 
genotype data from our case study, fastPHASE^^ was applied to the genotypes 
from the controls to estimate haplotypes for two genes of comparable size 
(25,9 and 30,6 kb), followed by use of HapSim^^ to simulate 10000 haplotypes 
for each gene. The respective numbers of polymorphic sites for each gene were 
79 and 92, Genotype data for a hypothetical individual were simulated by 
combining pairs of randomly selected haplotypes for each gene. 

Let rUgj represent the derived minor allele fi*equency (MAF) fi-om our 
simulated haplotype populations for marker j in gene /, Next, we randomly 
selected a fixed number of common {nigj > 0.05) markers to be causal for each 
gene. We then used a similar approach to effect size definition used by Wu 
et al}^ in which effects are a function of the MAF, Let Pi^t{rnij^ ^if) for an 
interaction effect, such that j6jnt(wij, m2/) = |log^Q(mijm2/) |, Here i 
defines the maximum possible interaction effect. For example, for two 
markers with MAF = 0,20 and t = 5, the interaction effect is 
ft„,(miy,m2/) = 0.14L 

Given the difficulty in genome-wide detection of gene-gene interactions 
without the presence of marginal effects, we only considered disease models 
with solely epistatic effects. We defined the probability of being a case 
conditional on genotype P(7= 1 | Xi,X2) via a logistic regression frame- 
work, such that 

logit(P(7= 1 |Xi,X2)) =^0 + ^int(^ii, rn2f)x,jX2f 

where Qi and Q2 define the subsets of markers which are causal, and Pq 
represents the disease prevalence. Sampling of cases and controls was then 
completed from a sufficiently large number of simulated genotype-phenotype 
pairs. 

To comparatively evaluate the performance of our KCCA method, we 
included additional methods on the basis of similar analysis principles: the 
original CCA-based approach, PC-based logistic regression^^ (PC-LR), and a 
composite-LD method^^ (CLD), PC-LR obtains principal components from 
SNP measurements for each gene, which are then fit in simple logistic 
regression. The CLD method is a covariance-based approach, which evaluates 
the difference between block interactions across case-control status. Additional 
details for each approach can be found in their respective publications. For 
PC-LR, we evaluate the significance of the interaction coefficient between the 
first principal component of each gene, and for the CLD approach we use 5000 
permutations to characterize the reference distribution of the test statistic, AU 
declarations of statistical significance are made at an a-level of 0,05, For both 
Type-I error and power simulations, we consider whether or not explicit 
marginal effects are included in the disease model. Each simulation scenario is 
conducted with case-controls status sample sizes of 500, 1000, and 1500, with a 
total of 1000 iterations each. 
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Ovarian cancer study 

We applied the KCCA approach to detect gene-gene interaction within the 
NF-kB pathway, using data from a case-control study of invasive epithelial 
ovarian cancer as part of the GAME-ON FoUow-up Ovarian Cancer Genetic 
Association and Interaction Studies collaboration (described elsewhere^^'^^). 
Participants were enrolled in the Mayo Clinic Ovarian Cancer Study, North 
Carolina Ovarian Cancer Study, the Tampa Bay Ovarian Cancer Study, the 
Toronto Ovarian Cancer Study, the National Cancer Institute Ovarian Case- 
Control Study in Poland, the UK Ovarian Cancer Population Study, the 
Studies of Epidemiology and Risk Factors in Cancer Heredity Ovarian Cancer 
Study, the Familial Ovarian Cancer Registry Study, and the Royal Marsden 
Hospital Ovarian Cancer Study^^'^^ Study protocols were approved by the 
appropriate institutional review board or ethics panel, and all patients 
provided written informed consent. Genotypes were from the lUumina (San 
Diego, CA, USA) 610-Quad SNP arrays, with imputation to HapMap v 26 
using MACH,^^ For 200 autosomal genes within the NF-kB pathway, this 
resulted in ~ 13 000 observed or imputed markers available on 3869 cases and 
3276 controls of European descent. For genotyped markers, we coded 
genotypes as 0, 1, or 2 in terms of the number of observed minor alleles; 
for imputed markers, we used the expected genotype or 'dosage'. Marker 
assignment to genes was determined on the basis of NCBI-build 36 gene 
location data, using a 20-kb buffer region on both the 5' and 3' ends of the 
defined gene location. Information on location, size, and number of SNPs in 
each of the 200 autosomal genes can be found in Supplementary Table SI. 

To address the effects of possible confounding variables, we adjusted the 
genotypes for age, study site, and the first five principal components from an 
eigen analysis.^^ Each unique gene pair was tested using the KCCA procedure. 



resulting in 




= 19900 total hypothesis tests. For purposes of 



comparison, we also appHed the CCA-based procedure defined by Peng 
et af to the data, using 1000 bootstraps for variance estimation. Comparisons 
involving gene pairs with overlapping regions were removed from our analysis 
to avoid complications involving shared marker data. 

RESULTS 

Type-I error 

For our simulations, we considered six levels of trimming within the 
jackknife procedure for SE estimation to determine which level was 
appropriate. A plot of the empirical Type-I error rate for all trim levels 
across each sample size are found in Figure 1. The KCCA results 
derived from co = 0.15 yield near nominal Type-I error rate levels 
across all sample sizes. Detailed statistics on the empirical distribu- 
tions of the test statistics can be found in Table 1. These results 
indicate that the test statistic follows the assumed standard normal 
distribution under the null. 
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Figure 1 Line plot of the empirical Type-I error rates from the null 
simulations across trim levels a> = 0.00, 0.05, 0.10, 0.15, 0.20, and 0.25, 
for sample sizes of 500, 1000, and 1500. 



Power 

For our power simulations, we set t = 5 and randomly selected five 
markers from each gene to be causal at each iteration, performing aU 
KCCA tests with co = 0.15. Bar graphs of the results are found in 
Figure 2. From this plot, we observe that KCCA outperforms the 
other methods across all sample sizes, particularly the CCA approach. 
As the poor performance of CCA is likely due to issues with 
overfitting, we considered additional simulations where sample sizes 
were fixed at 1000 and the number of markers per gene was set to 
values of 10, 20, 30, 40, or 50, randomly subsetted from the total 
number of markers used in our simulations. Keeping all the other 
settings of our simulation design the same, the empirical power for 
each method across number of markers per gene is found in Table 2. 
Although the resulting power is much closer for low marker numbers, 
the KCCA approach is stiU consistently more powerful than the 
remaining methods. 

Application to ovarian cancer study 

For the ovarian cancer study of the NF-kB pathway, we identified 13 
statistically significant (1FDR<0.05) gene pairs of interest (Table 3) in 
applying our KCCA method, using the trimmed jackknife SE 
estimate with co = 0.15, with the top coassociated gene pair with 
case-control status occurring between CASP8 and MAP3K3. Apphca- 
tion of the CCA-based procedure resulted in 37 significant gene pairs; 
however, none of these overlapped with those detected by the KCCA 
method. 

To explore one of the top coassociation hits at the SNP level, we 
analyzed the CASP8-MAP3K3 interaction by generating pairwise 
marker Pearson's correlations. These are presented by case-control 
status in Figure 3 as gradient colorized images of the correlation 
coefficients. The figure demonstrates that although the Pearson's 
correlation coefficients themselves are small (|p|<0.10), there exist 
distinct differences between the two correlation structures. Of note is 
the discrepancy across case-control status with the correlation 
between marker rsl2940055 in MAP3K3 (located at bp 59 075 874) 
and a large number of SNPs toward the 5' end of CASP8, indicated by 
the horizontal band in the lower half of the correlation plots. 

DISCUSSION 

The identification of interactions between genes and their impact on 
complex diseases is adding to our understanding the genetic 
component. Although SNP-based interaction analysis methods are 
relatively well developed, the large number of SNPs in association 
studies makes exhaustive pairwise SNP- SNP analyses increasingly 
infeasible. By addressing the problem at the gene level, the scope of 
the analyses is not only computationally tenable, but also reduced to a 
biologically interpretable unit of interest.^^ Peng et af presented a 
novel statistical method that allows for such analyses in their CCA 
statistic. However, as the number of genotyped markers increases, the 
apphcation of CCA may be inappropriate, because the number of 
genotyped SNPs may approach or exceed the number of observations, 
especially for smaller experimental designs. This is of particular 
concern with post-GWAS whole exome and genome sequencing 
association studies, which will afford the characterization of 
additional variants unmeasured by current genotyping platforms. 

In this article, we have presented a KCCA for gene-gene interaction 
analysis that not only addresses concerns of dimensionality, but also 
allows the flexibility to detect nonlinear correlations between genes. 
We have demonstrated that, using the appropriate SE estimation 
procedure, the KCCA test statistic exhibits near nominal levels of 
type-I error rate control and competitive power performance in our 
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Table 1 KCCA null simulation results for various trimming values co, which includes the P-value for the Kolmogorov-Smirnov test for 
normality (KS), the empirical SD and mean of the simulated test statistic distribution, as well as the realized Type-! error rate rejecting at an 
a-level of 0.05 
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Figure 2 Barplots of the empirical power at a-level of 0.05 for the KCCA, 
CCA, PC-LR, and CLD gene-gene interaction methods, for sample sizes of 
500, 1000, and 1500. 



data simulations. We have also outlined basic procedures for large- 
scale application that take into account issues of regularization, 
computational burden, and multiple testing. As a result, the KCCA 
procedure is a powerful tool for exploratory gene-gene interaction 
analysis using SNP data. 

It is important to note that although we have shown via our 
simulation results that the performance of the gene-level interaction 
analyses using KCCA is more powerful than other current methods 
under an interaction-only model, additional simulations have shown 
that the PC-LR approach performs best in the presence of marginal 
effects (Figure 4). Thus, if evidence suggests the existence of such 
effects, we recommend the use of the PC-based procedure instead of 
KCCA. Also, if we collapse the signal into a single interaction between 
two markers with no LD present, individual SNP-SNP interaction 
analysis via logistic regression easily bests the gene-level analyses in 
interaction detection. Thus, although we have demonstrated that our 
KCCA method performs well as a genome-wide exploratory meth- 
odology, alternative methods may perform better under specific 
circumstances. 

Although our data application analysis was conducted on genes 
within a specific pathway, we envision genome-wide exploratory 
applications, such as GWAS, to be completed in a similar fashion. 
However, owing to combinatorial scaling, this will require special 
computational considerations such as parallelization. For example, 
exhaustive pairwise analysis of a 20 000-gene genome requires nearly 



Table 2 Simulation result for empirical power, using a = 0.05 level 
significance testing, for the KCCA, CCA, PC-LR, and CLD testing 
procedures across varying numbers of markers per gene 

Total markers per gene 





10 


20 


30 


40 


50 


KCCA 


0.852 


0.818 


0.799 


0.779 


0.757 


CCA 


0.428 


0.097 


0.012 


0.002 


0.000 


PC-LR 


0.570 


0.575 


0.552 


0.545 


0.544 


CLD 


0.724 


0.570 


0.371 


0.305 


0.234 



Sample sizes are fixed at 1000, with five causal markers per gene. 

200 million unique tests, although this total pales in comparison with 
the number of possible SNP-SNP interactions. 



Comparability with CCA statistic 

Contrary to the findings of Peng et alj we have found that hypothesis 
testing using their CCA statistic with bootstrap variance estimation 
can be quite conservative, regardless of sample or feature size used, 
resulting in possible issues with reduced statistical power. This is 
evidenced in our power simulations where CCA power results for 
certain simulation conditions results in power below even nominal 
Type- 1 error rates. Upon investigation of this discrepancy, we found 
that the bootstrap variance estimates were quite large in comparison 
with their KCCA counterparts. 

An additional benefit of our procedure is that the assumptions of 
CCA include multivariate normality of the observations, which is 
clearly violated by the discrete nature of genotype calls if no 
adjustments are made. Our method, however, involves low dimen- 
sional projections of kernelized observations, which has been shown 
to be approximately Gaussian.^^ Thus, our KCCA procedure is also 
more consistent with the distributional assumptions of CCA. 



Ovarian cancer findings 

The analysis of the FOCI data using the KCCA procedure yielded 
biologically interesting results, with many of the top gene pairs 
sharing some functional basis. CASP8 and MAP3K3 are integral 
members of the tumor necrosis factor pathway, and ILIA and ILIB 
both code for proinflammatory cytokines and have been jointly 
associated with lung cancer.^^ With 13 statistically significant findings, 
there are also several novel interactions that may warrant further 
investigation. 
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Table 3 Detailed results of the significant KCCA gene-gene coassociations for analysis of ovarian cancer risk of significant (IFDR<0.05) 
gene-gene interactions from FOCI analysis ranked by estimated IFDR 



KCCA CCA 



Gene 1 


Gene 2 


^12 


^12 
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P-value 


IFDR 


Tl2 


P-value 
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MAP3K3 


0.0875 


0.0074 


14.2897 


<1.00E-16 


3.38E-21 


-0.7248 


0.46853 
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1.7243 
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0.0317 
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<1.00E-16 


2.59E-11 


-0.2520 


0.80098 


BIRC3 


MAP3K3 


0.0763 


0.0194 


5.7655 


8.14E-09 


0.028977 


0.7772 


0.43702 


MAP3K1 


TRAF7 


0.1125 


0.0465 


5.5000 


3.80E-08 


0.049916 


-1.0316 


0.30223 


BIRC3 


TNFRSF13C 


0.0601 


0.0001 


10.8523 


<1.00E-16 


8.74E-12 


-0.1986 


0.84251 


MAP3K1 


MAP3K8 


0.1032 


0.0377 


6.1801 


6.41E-10 


0.010925 


0.3529 


0.72412 


PYCARD 


TAF4 


0.0231 


0.1012 


-6.5902 


4.39E-11 


0.016345 


0.0313 


0.97502 



Abbreviations: CCA, canonical correlation analysis; FOCI, Follow-up Ovarian Cancer Genetic Association and Interaction Studies; KCCA, kernelized version of CCA; IFDR, local false discovery rate. 
Includes the Fisher-transformed maximal kernel canonical correlation values for cases and controls, the resulting test statistic, P-values, and IFDR estimates for KCCA, as well as results for the 
CCA analysis. 
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Figure 3 Colorized image plot of Pearson's correlation values between SNPs for CASP8-MAP3K3 coassociation for cases (left) and controls (right). The axes 
depict the genomic position of the markers on the respective genes. 
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Figure 4 Barplots of the empirical power at a-level of 0.05 for the KCCA, 
CCA, PC-LR, and CLD gene-gene interaction methods, under sample sizes 
of 500, 1000, and 1500 with the inclusion of statistically significant 
marginal effects. 



We argue the lack of congmenq^ between the results of the 
CCA-based procedure and our kernelized version is due, in large 
part, to the lack of any dimensional reduction in the CCA. Although 
the sample sizes used in our ovarian application analyses are large 
enough to satisfy most conventional notions of appropriate observa- 
tion-to-feature ratios for accurate canonical correlation estimation, 
our simulations indicate that the CCA method suffers greatly from 
overfitting when there are a relatively large number of genotyped 
markers in given genes. 

One significant gene pair of concern in the NF-kB-KCCA inter- 
action analysis is ILIA-ILIB, which includes genes that are position- 
ally adjacent to one another. Taking into account their respective 
buffer regions, only 4.4 kb separates the two genes. The significance of 
this gene pair could be evidence of instability of the method for gene 
pairs that are in high LD with each other because of locational 
proximity. However, gene pairs of this nature are also often closely 
linked functionally. Moreover, there are counter examples in our 
analysis of neighboring gene pairs that are statistically not significant 
(eg, LTBR-TNFRSFIA). Regardless, we recommend caution in the 
interpretation of such results. 
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Conclusions and future development 

Our KCCA algorithm simultaneously supplies dimensionality reduc- 
tion and nonlinear coassociation analysis for high- dimensional SNP 
data, providing a powerful framework for detecting statistical epistasis 
at the gene level. Moreover, this type of analysis can isolate gene pairs 
of interest for follow-up analysis without being burdened by the 
multiple testing corrections necessary for genome-wide SNP-SNP 
pairwise interaction analysis. This is particularly relevant for 
next- generation sequencing applications, which may interrogate aU 
possible SNPs through whole genome sequencing. 

Although we have argued that the KCCA procedure for detecting 
gene-gene interaction possesses many advantages over the previously 
proposed CCA statistic, there is also room for improvement and 
generalizability of our approach. The use of the Gaussian kernel 
function is a robust selection; however, other kernel functions may be 
more appropriate for the specific data type, particularly if there are no 
adjustments for covariate data.^^ The procedure itself may also be 
modified in a variety of ways, including the use of sparse canonical 
correlation^^'^^ and multigene interaction analysis with generalized 
canonical correlation,^^ and further exploring the resampling 
procedures used in the SE estimation. Finally, a less computa- 
tionally intensive alternative to KCCA may be a kernelized variant 
of principal correlation,^^ which could be considered for more 
demanding analyses such as genome-wide interrogation. 
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